16-cs02-eda

Professor Shannon Ellis

3/7/23

CS02: Vaping Behaviors in American Youth (EDA)

Course Announcements

Due Dates:

  • Lecture Participation survey “due” after class
  • Lab08 Due Friday (CS02 EDA)
  • CS02 due Monday of Finals week (3/20)
  • Final Project due Th of Finals week (3/23)

Notes:

  • cs01 group work form is still open - please complete (if you haven’t)
  • cs02 and final project groups/repos assigned
  • don’t wait until finals week to do cs02 and final project

Questions

  1. How has tobacco and e-cigarette/vaping use by American youths changed since 2015?
  2. How does e-cigarette use compare between males and females?
  3. What vaping brands and flavors appear to be used the most frequently?
  4. Is there a relationship between e-cigarette/vaping use and other tobacco use?

Data

…will only work if you finished last set of notes (or open this week’s lab).

load("data/wrangled/wrangled_data_vaping.rda")

Question

When is pivoting your data from wide to long (or long to wide) helpful?

data |>
   pivot_longer(cols = columns_to_pivot , names_to = "new_col_for_labels" , values_to = "new_col_for_values")

EDA: Brainstorm

What do you want to know?

Notes from class:

  • count:
    • by age groups - age distribution
    • Grade
    • Group
    • Sex: Males, Females, NA
  • Relative proportion of e-cig use by gender (anecdote: spent cartridges in male br and not female in hs; hypothesis: male use higher than females, but want to see)
  • Distribution of most popular brands (all brands)
  • Distribution of most popuular flavors (all flavors)
  • year on the x-axis; y is frequency of use; some measurement of use (e-cig, tobacco, …)

EDA: skim

library(skimr)
skim(nyts_data)
Data summary
Name nyts_data
Number of rows 95465
Number of columns 59
_______________________
Column type frequency:
character 7
logical 43
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
psu 0 1.00 5 6 0 431 0
stratum 0 1.00 3 3 0 16 0
Age 417 1.00 1 3 0 11 0
Sex 778 0.99 4 6 0 2 0
Grade 477 1.00 1 14 0 8 0
brand_ecig 91861 0.04 3 7 0 7 0
Group 0 1.00 7 23 0 4 0

Variable type: logical

skim_variable n_missing complete_rate mean count
ECIGT 1470 0.98 0.18 FAL: 76819, TRU: 17176
ECIGAR 1757 0.98 0.15 FAL: 79724, TRU: 13984
ESLT 1862 0.98 0.07 FAL: 87041, TRU: 6562
EELCIGT 1715 0.98 0.26 FAL: 69417, TRU: 24333
EROLLCIGTS 2929 0.97 0.05 FAL: 88100, TRU: 4436
EFLAVCIGTS 78469 0.18 0.05 FAL: 16126, TRU: 870
EBIDIS 2940 0.97 0.01 FAL: 91369, TRU: 1156
EFLAVCIGAR 58568 0.39 0.09 FAL: 33530, TRU: 3367
EHOOKAH 2388 0.97 0.09 FAL: 84515, TRU: 8562
EPIPE 2941 0.97 0.02 FAL: 90348, TRU: 2176
ESNUS 2941 0.97 0.03 FAL: 89379, TRU: 3145
EDISSOLV 2939 0.97 0.01 FAL: 91423, TRU: 1103
CCIGT 1840 0.98 0.05 FAL: 88747, TRU: 4878
CCIGAR 2019 0.98 0.05 FAL: 88423, TRU: 5023
CSLT 2173 0.98 0.03 FAL: 90496, TRU: 2796
CELCIGT 1505 0.98 0.12 FAL: 82799, TRU: 11161
CROLLCIGTS 3049 0.97 0.02 FAL: 90440, TRU: 1976
CFLAVCIGTS 78521 0.18 0.02 FAL: 16529, TRU: 415
CBIDIS 3038 0.97 0.01 FAL: 91956, TRU: 471
CHOOKAH 2666 0.97 0.03 FAL: 89657, TRU: 3142
CPIPE 3061 0.97 0.01 FAL: 91603, TRU: 801
CSNUS 3053 0.97 0.01 FAL: 91198, TRU: 1214
CDISSOLV 3050 0.97 0.01 FAL: 91938, TRU: 477
menthol 17711 0.81 0.06 FAL: 73305, TRU: 4449
clove_spice 17711 0.81 0.01 FAL: 77360, TRU: 394
fruit 17711 0.81 0.07 FAL: 71945, TRU: 5809
chocolate 17711 0.81 0.01 FAL: 76875, TRU: 879
alcoholic_drink 17711 0.81 0.02 FAL: 76510, TRU: 1244
candy_dessert_sweets 17711 0.81 0.05 FAL: 74188, TRU: 3566
other 17711 0.81 0.03 FAL: 75675, TRU: 2079
EHTP 78434 0.18 0.02 FAL: 16633, TRU: 398
CHTP 76592 0.20 0.02 FAL: 18582, TRU: 291
tobacco_ever 0 1.00 0.35 FAL: 61793, TRU: 33672
tobacco_current 0 1.00 0.18 FAL: 78757, TRU: 16708
ecig_ever 0 1.00 0.25 FAL: 71132, TRU: 24333
ecig_current 0 1.00 0.12 FAL: 84304, TRU: 11161
non_ecig_ever 0 1.00 0.27 FAL: 69674, TRU: 25791
non_ecig_current 0 1.00 0.12 FAL: 84419, TRU: 11046
ecig_only_ever 0 1.00 0.05 FAL: 90308, TRU: 5157
ecig_only_current 0 1.00 0.03 FAL: 92756, TRU: 2709
non_ecig_only_ever 0 1.00 0.07 FAL: 89147, TRU: 6318
non_ecig_only_current 0 1.00 0.03 FAL: 92439, TRU: 3026
no_use 0 1.00 0.65 TRU: 61738, FAL: 33727

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
year 0 1 2017.02 1.40 2015.00 2016.00 2017.00 2018.00 2019.00 ▇▇▇▇▇
finwgt 0 1 1421.44 1093.13 11.15 708.52 1131.48 1754.38 6505.08 ▇▅▁▁▁
tobacco_sum_ever 0 1 0.91 1.68 0.00 0.00 0.00 1.00 12.00 ▇▁▁▁▁
tobacco_sum_current 0 1 0.34 0.97 0.00 0.00 0.00 0.00 11.00 ▇▁▁▁▁
ecig_sum_ever 0 1 0.25 0.44 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▃
ecig_sum_current 0 1 0.12 0.32 0.00 0.00 0.00 0.00 1.00 ▇▁▁▁▁
non_ecig_sum_ever 0 1 0.66 1.41 0.00 0.00 0.00 1.00 11.00 ▇▁▁▁▁
non_ecig_sum_current 0 1 0.23 0.78 0.00 0.00 0.00 0.00 10.00 ▇▁▁▁▁
n 0 1 19167.48 1193.77 17711.00 17872.00 19018.00 20189.00 20675.00 ▇▁▃▁▇

Note: If you include this in a report, you should also guide the viewer. Theres a lot in there. What do you want your reader to know?

Things discussed in class when looking at skim output:

  • why brand_ecig has so many missing values (only have brand information for 2019)
  • what mean calculation for a categorical value means (proportion of TRUE values)
  • why Age min and max values look off (b/c it’s coded as a character due to >18 category)
  • that the n category is how many respondents there are each year

Remember: It can be very helpful to thinkg “what overall trends do I see?” and “is there anything weird going on?” any time you’re looking at an EDA output

EDA: Basics

Getting a sense for some of the categorical data

table(nyts_data$Age)

  >18    10    11    12    13    14    15    16    17    18     9 
  727    50  5360 13499 14613 14036 13498 13205 12754  7108   198 
table(nyts_data$Group)

Combination of products                 Neither       Only e-cigarettes 
                  16517                   61738                    7866 
    Only other products 
                   9344 
table(nyts_data$year)

 2015  2016  2017  2018  2019 
17711 20675 17872 20189 19018 

Remember: If you include this in a report, you’ll also need text to explain what the reader should know/take away from any of these.

EDA: Plots

Sex by Year

nyts_data |>
  group_by(year) |>
  count(Sex) |> 
  ggplot(aes(x=year, y=n, color=Sex)) + 
  geom_col()

(Note: I argued that this was not the best way to display these data…but it’s a good start. If you include in your report, you likely want to improve!)

E-Cig use by gender

Student suggestion: Relative proportion of e-cig use by gender (anecdote: spent cartridges in male br and not female in hs; hypothesis: male use higher than females, but want to see)

  • barplot between two genders; proportion of each (gender on x axis - proportion of use on y axis); doing the same across different tobacco use categories
  • want to capture year in some way (facet)
  • line plot (year on x-axis; measurement of use on y-axis (proportion), lines for Sex)
nyts_data |>
  group_by(year, Sex) |>
  # count things 
  summarize(mean_ever = mean(ecig_ever, na.rm=TRUE)) |>
  ggplot(aes(x=year, y=mean_ever, group=Sex, color=Sex)) +
  geom_line()

Brands

Student suggestion: Distribution of most popular brands (all brands)

nyts_data |> 
  group_by(year) |>
  count(brand_ecig)
# A tibble: 12 × 3
# Groups:   year [5]
    year brand_ecig     n
   <dbl> <chr>      <int>
 1  2015 <NA>       17711
 2  2016 <NA>       20675
 3  2017 <NA>       17872
 4  2018 <NA>       20189
 5  2019 Blu          111
 6  2019 JUUL        2028
 7  2019 Logic         36
 8  2019 MarkTen       32
 9  2019 NJOY          44
10  2019 Other       1253
11  2019 Vuse         100
12  2019 <NA>       15414

The above helps us remember that we only have brand information from 2019.

nyts_data |> 
  filter(year == 2019, !is.na(brand_ecig)) |> 
  ggplot(aes(x=brand_ecig)) +
  geom_bar()

Note: If including in a report, you’ll want titles, cleaner axis names, and likely to sort the x-axis values, but for your first pass EDA where you’re just trying to understand the data, this is sufficient.

Flavors

Student suggestion: Distribution of most popular flavors (all flavors)

This one is more complicated b/c flavor data are across multiple columns…going to leave this for the analysis set of notes

Tabacco use across time

Student suggestion: year on the x-axis; y is frequency of use; some measurement of use (e-cig, tobacco, …)

nyts_data |>
    group_by(year) |>
  # count things 
  summarize(mean_ecig_ever = mean(ecig_ever, na.rm=TRUE),
            mean_tobacco_ever = mean(tobacco_ever, na.rm=TRUE)) |>
  pivot_longer(-year, names_to = "variable", values_to = "values") |>
  ggplot(aes(x=year, y=values, group=variable, linetype=variable)) +
  geom_line()

Suggested Reading